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A PERFORMANCE OPTIMIZED APPROACH 
FOR EFFICIENT DOWNSAMPLING OPERATIONS 

5 

FIELD OF THE INVENTION 

The present invention relates to the areas of computation and algorithms and 
specifically to the areas of digital signal processing ("DSP") and digital logic for 
performing DSP operations to the areas of digital signal processing, algorithms, 
10 structures and systems for performing digital signal processing. In particular, the 
present invention relates to reconfigurable system for providing non-downsampling 
and downsampling operations on a signal. 

BACKGROUND INFORMATION 

1 5 Digital signal processing ("DSP") and information theory technology is 

essential to modern information processing and in telecommunications for both the 
efficient storage and transmission of data. In particular, effective multimedia 
communications including speech, audio and video relies on efficient methods and 
structures for compression of the multimedia data in order to conserve bandwidth on 

20 the transmission channel as well as to conserve storage requirements. 

Many DSP algorithms rely on transform kernels such as an FFT ("Fast Fourier 
Transform"), DCT ("Discrete Cosine Transform"), etc. For example, the discrete 
cosine transform ("DCT") has become a very widely used component in performing 
compression of multimedia information, in particular video information. The DCT is 

25 a loss-less mathematical transformation that converts a spatial or time representation 
of a signal into a frequency representation. The DCT offers attractive properties for 
converting between spatial/time domain and frequency representations of signals as 
opposed to other transforms such as the DFT ("Discrete Fourier Transform")/FFT. In 
particular, the kernel of the transform is real, reducing the complexity of processor 

30 calculations that must be performed. In addition, a significant advantage of the DCT 
for compression is that it exhibits an energy compaction property, wherein the signal 
energy in the transform domain is concentrated in low frequency components, while 
higher frequency components are typically much smaller in magnitude, and may often 
be discarded. The DCT is in fact asymptotic to the statistically optimal 

35 Karhunen-Loeve transform ("KLT") for Markov signals of all orders. Since its 




introduction in 1 974, the DCT has been used in many applications such as filtering, 
transmultiplexers, speech coding, image coding (still frame, video and image storage), 
pattern recognition, image enhancement and SAR/IR image coding. The DCT has 
played an important role in commercial applications involving DSP, most notably it 
has been adopted by MPEG ("Motion Picture Experts Group") for use in MPEG 2 and 
MPEG 4 video compression algorithms. 

A computation that is common in digital filters such as finite impulse response 
("FIR") filters or linear transformations such as the DFT and DCT may be expressed 
mathematically by the following dot-product equation: 

where a(i) are the input data, b(i) are the filter coefficients (taps) and d is the output. 
Typically a multiply-accumulator ("MAC") is employed in traditional DSP design in 
order to accelerate this type of computation. A MAC kernel can be described by the 
following equation: 

d l!+l] = d l ' ] +a(i)*b(i) with initial value d [0] = 0. 

In some cases it is advantageous to downsample a signal For example, with 
images, it is often advantageous to view an image in a smaller frame. However, the 
algorithms for generating a downsampled signal vs. a non-downsampled signal will 
typically vary significantly. Thus, typically it is required to provide separate 
hardware structures to generate either a downsampled signal or a non-downsampled 
signal. This is highly disadvantageous as it results in increased hardware area, 
complexity and cost. Thus, it would be advantageous to develop a hardware structure 
capable of operating in one of a downsampling or non-downsampling modes, while 
reducing the redundancy of hardware elements as much as possible. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. la is a block diagram of a video encoding system. 

FIG. lb is a block diagram of a video decoding system. 

FIG. 2 is a block diagram of a datapath for computing a 2-D IDCT. 

FIG. 3 is a block diagram illustrating the operation of a MAC kernel. 
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FIG. 4 is a block diagram illustrating the operation of a MAAC kernel 
according to one embodiment of the present invention. 

FIG. 5 illustrates a paradigm for improving computational processes utilizing 
a MAAC kernel according to one embodiment of the present invention. 
5 FIG. 6 is a block diagram of a hardware architecture for computing an eight- 

point IDCT utilizing a MAAC kernel according to one embodiment of the present 
invention. 

FIG. 7 is a block diagram of a datapath for computation of an 8-point IDCT 
utilizing the method of the present invention and a number of MAAC kernel 
10 components according to one embodiment of the present invention. 

FIG. 8 is a block diagram illustrating the operation of an AMAAC kernel 
according to one embodiment of the present invention. 
1 FIG. 9 is a block diagram of a reconfigurable downsampling computation 

engine according to one embodiment of the present invention. 
15 FIG. 10 is a block diagram of a hardware architecture for computing an eight- 

point IDCT in a 2:1 downsampling mode according to one embodiment of the present 
invention. 

FIG. 1 la is a block diagram illustrating a datapath for computing a first path 
of an eight-point 2-D IDCT in a non-downsampling mode according to one 
20 embodiment of the present invention. 

FIG. 1 lb is a block diagram illustrating a datapath for computing a second 
path of an eight-point 2-D IDCT in a non-downsampling mode according to one 
embodiment of the present invention. 

FIG. 12a is a block diagram illustrating a datapath for computing a first path 
25 of an eight-point to four-point 2-D IDCT in a downsampling mode according to one 
embodiment of the present invention. 

FIG. 12b is a block diagram illustrating a datapath for computing a second 
path of an eight-point to four-point 2-D IDCT in a downsampling mode according to 
one embodiment of the present invention. 

30 

DETAILED DESCRIPTION 

The present invention provides an algorithm and hardware structure for 
numerical operations on signals that is reconfigurable to operate in a downsampling or 
non-downsampling mode. According to one embodiment, a plurality of adders and 
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multipliers are reconfigurable via a switching fabric to operate as a plurality of 
MAAC kernels (described in detail below), when operating in a non-downsampling 
mode, and a plurality of MAAC kernels and AMAAC kernels (described in detail 
below), when operating in a downsampling mode. According to one embodiment, the 
5 downsampling and non-downsampling operations are performed as part of an IDCT 
process. 

FIG. 1 a is a block diagram of a video encoding system. Video encoding 
system 123 includes DCT block 111, quantization block 113, inverse quantization 
block 115, IDCT block 140, motion compensation block 150, frame memory block 

10 160, motion estimation block and VLC ("Variable Length Coder") block 131. Input 
video is received in digitized form. Together with one or more reference video data 
from frame memory, input video is provided to motion estimation block 121, where a 
motion estimation process is performed. The output of motion estimation block 121 
containing motion information such as motion vectors is transferred to motion 

1 5 compensation block 1 50 and VLC block 131. Using motion vectors and one or many 
reference video data, motion compensation block 150 performs motion compensation 
process to generate motion prediction results. Input video is subtracted at adder 170a 
by the motion prediction results from motion compensation block 150. 

The output of adder is provided to DCT block 111 where a DCT computed. 

20 The output of the DCT is provided to quantization block 113, where the frequency 
coefficients are quantized and then transmitted to VLC ("Variable Length Coder") 
131, where a variable length coding process (e.g., Huffman coding) is performed. 
Motion information from motion estimation block 121 and quantized indices of DCT 
coefficients from Q block 1 13 are provided to VLC block 131. The output of VLC 

25 block 131 is the compressed video data output from video encoder 123The output of 
quantities block 1 13 is also transmitted to inverse quantization block 1 15, where an 
inverse quantization process is performed. 

The output of inverse quantization block is provided to IDCT block 140, 
where IDCT is performed. The output of IDCT block is summered at adder 107(b) 

30 with motion prediction results from motion compensation. The output of adder 170b 
is reconstructed video data and is stored in the frame memory block 160 to serve as 
reference data for the encoding of future video data. 

FIG. lb is a block diagram of a video decoding system. Video decoding 
system 125 includes variable length decoder ("VLD") block 110, inverse scan ("IS") 
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block 120, inverse quantization block ("IQ") 130, IDCT block 140, frame memory 
block 160, motion compensation block 150 and adder 170. A compressed video 
bitstream is received by VLD block and decoded. The decoded symbols are converted 
into quantized indices of DCT coefficients and their associated sequential locations in 
5 a particular scanning order. The sequential locations are then converted into 

frequency-domain locations by the IS block 120. The quantized indices of DCT 
coefficients are converted to DCT coefficients by the IQ block 130. The DCT 
coefficients are received by IDCT block 140 and transformed. The output from the 
IDCT is then combined with the output of motion compensation block 150 by the 

10 adder 1 70. The motion compensation block 150 may reconstruct individual pictures 
based upon the changes from one picture to its reference picture(s). Data from the 
reference picture(s), a previous one or a future one or both, may be stored in a 
temporary frame memory block 160 such as a frame buffer and may be used as the 
references. The motion compensation block 150 uses the motion vectors decoded 

15 from the VLD 1 10 to determine how the current picture in the sequence changes from 
the reference picture(s). The output of the motion compensation block 150 is the 
motion prediction data. The motion prediction data is added to the output of the IDCT 
140 by the adder 170. The output from the adder 170 is then clipped (not shown) to 
become the reconstructed video data. 

20 FIG. 2 is a block diagram of a datapath for computing a 2-dimensional (2-D) 

IDCT according to one embodiment of the present invention. It includes a data 
multiplexer 205, a 1-D IDCT block 210, a data demultiplexer 207 and a transport 
storage unit 220. Incoming data from IQ is processed in two passes through the IDCT. 
In the first pass, the IDCT block is configured to perform a 1-D IDCT transform along 

25 vertical direction. In this pass, data from IQ is selected by the multiplexer 210 and 
processed by the 1-D IDCT block 210. The output from IDCT block 210 is an 
intermediate results that are selected by the demultiplexer to be stored in the transport 
storage unit 220. In the second pass, IDCT block 210 is configured to perform 1-D 
IDCT along horizontal direction. As such, the intermediate data stored in the transport 

30 storage unit 220 is selected by multiplexer 205, and processed by the 1-D IDCT block 
210. Demultiplexer 207 outputs results from the 1-D IDCT block as the final result of 
the 2-D IDCT. 

Many computational processes such as the transforms described above (i.e., 
DCT, IDCT, DFT, etc) and filtering operations rely upon a multiply and accumulate 
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kernel. That is, the algorithms are effectively performed utilizing one or more 
multiply and accumulate components typically implemented as specialized hardware 
on a DSP or other computer chip. The commonality of the MAC nature of these 
processes has resulted in the development of particular digital logic and circuit 
5 structures to carry out multiply and accumulate processes. In particular, a 
fundamental component of any DSP chip today is the MAC unit. 

FIG. 3 is a block diagram illustrating the operation of a MAC kernel. 
Multiplier 310 performs multiplication of input datum a(i) and filter coefficient b(i), 
the result of which is passed to adder 320. Adder 320 adds the result of multiplier 310 

10 to accumulated output d [l] which was previously stored in register 330. The output of 
adder 320 ( d lM] ) is then stored in register 330. Typically a MAC output is generated 
on each clock cycle. 

FIG. 4 is a block diagram illustrating the operation of a MAAC kernel 
according to one embodiment of the present invention. The MAAC kernel can be 

1 5 described by the following recursive equation: 

d [M] = d U] +a (i)*b(i) + c(i) with initial value d [0] = 0. 
MAAC kernel 405 includes multiplier 310, adder 320 and register 330. Multiplier 
3 1 0 performs multiplication of input datum a(i) and filter coefficient b(i), the result of 
which is passed to adder 320. Adder 320 adds the result of multiplier 310 to a second 

20 input term c(i) along with accumulated output d [,] , which was previously stored in 
register 330. The output of adder 320 ( d [M] ) is then stored in register 330. 

As an additional addition (c(i)) is performed each cycle, the MAAC kernel 
will have higher performance throughput for some class of computations. For 
example, the throughput of a digital filter with some filter coefficients equal to one 

25 can be improved utilizing the MAAC architecture depicted in FIG. 4. 

FIG. 5 illustrates a paradigm for improving computational processes utilizing 
a MAAC kernel according to one embodiment of the present invention. In 510, an 
expression for a particular computation is determined. Typically, the computation is 
expressed as a linear combination of input elements a(i) scaled by a respective 

30 coefficient b(i). That is, the present invention provides for improved efficiency of 
performance for computational problems that may be expressed in the general form: 

</ = 5>(i)**(0 
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where a(i) are the input data, b(i) are coefficients and d is the output. As noted above, 
utilizing a traditional MAC architecture, output d may be computed utilizing a kernel 
of the form: 

d li+l] = d U] + a(i)*b(i) with initial value d [0] = 0. 
5 This type of computation occurs very frequently in many applications 

including digital signal processing, digital filtering etc. 

In 520, a common factor 'c' is factored out of the expression obtaining the 
following expression: 

d = c2a(0*&'(0 where b(i)=cb'(i). 

i=0 

10 If as a result of factoring the common factor c, some of the coefficients b'(i) 

are unity, then the following result is obtained. 

d = c\ £ a(0 * b '(0 + X a{i) ] where {b\i) =\:M <i<N-\) 

V i=0 i=M J 

This may be effected, for example, by factoring a matrix expression such that 
certain matrix entries are ' 1'. The above expression lends itself to use of the MAAC 
15 kernel described above by the recursive equation: 

d iM] = d [i] +a(i)*b(i) + c(i) with initial value d [ ° 3 = 0. 
In this form the computation utilizes at least one addition per cycle due to the unity 
coefficients. 

In step 530, based upon the re-expression of the computational process 
20 accomplished in step 5 1 0, one or more MAAC kernels are arranged in a configuration 
to carry out the computational process as represented in its re-expressed form of step 
520. 

The paradigm depicted in FIG. 5 is particularly useful for multiply and 
accumulate computational processes. According to one embodiment, described 
25 herein, the method of the present invention is applied to provide a more efficient 

IDCT computation, which is a multiply and accumulate process typically carried out 
using a plurality of MAC kernels. 

According to one embodiment, the present invention is applied to the IDCT in 
order to reduce computational complexity and improve efficiency. According to the 
30 present invention, the number of clock cycles required in a particular hardware 
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implementation to carry ou+ the IDCT is reduced significantly by application of the 
present invention. 

The 2-D DCT may be expressed as follows: 



if k=0 

where a(k)=w2 r 



[l otherwise J 
The 2-D IDCT may be expressed as follows: 



** =Zg^^(^(Ocos^^^jcos^-^- 



-^,ifk=0 
where a(k)= v2 f 

[1 otherwise J 

The 2-D DCT and IDCT are separable and may be factored as follows: 
X y = J^z kj -e itM (k) fori=0,l,. . .,M-1 andj=0,l,. . .,N-1 

k=0 

where the temporal 1-D IDCT data are: 

z kj = E>V^(0 fork=0,l,. . .,M-1 andj=0,l,. . .,N-1 

1=0 

and the DCT basis vectors e^m) are: 

- for ^ u ' ■ - M - 1 

10 A fast algorithm for calculating the IDCT (Chen) capitalizes of the cyclic property of 
the transform basis function (the cosine function). For example, for an eight-point 
IDCT, the basis function only assumes 8 different positive and negative values as 
shown in the following table: 



j/1 


0 


1 


2 


3 


4 


5 


6 


7 


0 


c(0) 


c(l) 


c(2) 


c(3) 


c(4) 


c(5) 


c(6) 


c(7) 


1 


c(0) 


c(3) 


c(6) 


-c(7) 


-c(4) 


-c(l) 


-c(2) 


-c(5) 


2 


c(0) 


c(5) 


-c(6) 


-c(l) 


-c(4) 


c(7) 


c(2) 


c(3) 


3 


c(0) 


c(7) 


-c(2) 


-c(5) 


c(4) 


c(3) 


-c(6) 


-c(l) 


4 


c(0) 


-c(7) 


-c(2) 


c(5) 


c(4) 


-c(3) 


-c(6) 


c(l) 



5 


c(0) 


-c(5) , 


-c(6)' 


c(l) 


-c(4) 


-c(7) 


c(2) 


-c(3) 


6 


c(0) 


-c(3) 


c(6) 


c(7) 


-c(4) 


c(l) 


-c(2) 


c(5) 


7 


c(0) 


-c(l) 


c(2) 


-c(3) 


c(4) 


-c(5) 


c(6) 


-c(7) 



Where c(m) is the index of the following basis terms. 
c(m) = a(m)cos\ — 




5 The cyclical nature of the IDCT shown in the above table provides the following 
relationship between output terms of the 1-D IDCT: 

Xi+ * 1 ~ i = e,.(0)y o + e,(2)j 2 +e i (4)y 4 + e,.(6).y 6 

hZ^tL = ei (i)^ + Bi (3)y 3 + e t (5)y 5 + e. t (7)y 7 

where the basis terms e^k) have sign and value mapped to the DCT basis terms c(m) 
according to the relationship: 

10 e,.(&) = ±^c(m(7,£)) 

For a 4-point IDCT, the basis terms also have the symmetrical property 
illustrated in the above table as follows: 



j/1 


0 


1 


2 


3 


0 


C(0) 


C(2) 


C(4) 


C(6) 


1 


C(0) 


C(6) 


-C(4) 


-C(2) 


2 


C(0) 


-C(6) 


-C(4) 


C(2) 


3 


C(0) 


-C(2) 


C(4) 


-C(6) 



15 The corresponding equations are: 
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^3=L = e i (2)y,+e i (6)y i 

Based upon the above derivation, a ID 8-point IDCT can be represented by 
the following matrix vector equation: 



x 0 




^0 




y\ 




X 7 




Jo 




Ji 


X, 




y 4 




ys 




x 6 


= Ia 


J 4 


-Ib 




x 2 


2 


y 2 


2 


y 3 




x 5 


2 




2 


^3 


X-, 








Vi 




X 4 




;>« 




^7 



5 where: 

c(l) c(5) c(3) c(7)~ 

c(3) -c(l) -c(7) -c(5) 

c(5) c(7) -c(l) c(3) 

c(7) c(3) -c(5) -c(l) 

and c(0) = cos^J and c(n)=cosjj|J (n=l, 2, 3, 4, 5, 6 7) 

Note that A" 1 =-A r andB" 1 =-B r 
2 2 

10 Using the paradigm depicted in FIG. 5, a common factor may be factored from 

the matrix equation above such that certain coefficients are unity. The unity 
coefficients then allow for the introduction of a number of MAAC kernels in a 
computational architecture, thereby reducing the number of clock cycles required to 

carry out the IDCT. In particular, by factoring c(0)=c(4)= -J=- out from the matrix 

15 vector equation above, the following equation is obtained. 



x 0 




y 0 




yx 




x 7 




y 0 






x \ 
x 2 


= Ia- 

2 


y 4 
yi 


2 


y 5 
j 3 




X 6 

x 5 


-Ia- 

2 


y 4 
yi 


-Ib- 

2 


y 5 


X-, 




y« 




yi 




x 4 




y6 




^7 



where: 



'c(O) c(4) c(2) c(6)~ 

c(0) -c(4) c(6) -c(2) 

c(0) -c(4) -c(6) c(2) 

c(0) c(4) -c(2) -c(6) 
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1 1 c'(2) > c'(6) 

1 -1 c'(6) -c'(2) 

1 -1 -c'(6) c'(2) 

1 1 -c'(2) -c'(6) 



c'(l) c'(5) c'(3) c'(7)" 

c'(3) -c'(7) -c'(5) 

c'(5) c'(7) -c'(l) c'(3) 

c'(7) c'(3) -c'(5) -c'(l). 



Because the factor -U is factored out of the matrix vector equation, the results after 

4i 

two-dimensional operations would carry a scale factor of two. Dividing the final 
result by 2 after the two-dimensional computation would result in the correct 
transform. 

Note that the expression for the IDCT derived above incorporates multiple 
instances of the generalized expression d = ^a(i)*b(i) re-expressed as 



d=c\Y j a(i)*b \i) + ]T a(i) where {b '(/) = 1 : M < i < N - 1} to which the present 



invention is addressed. This is a consequence of the nature of matrix multiplication 
10 and may be seen as follows (unpacking the matrix multiplication): 



x 0 =y 0 + y4 +c\2)*y 2 +cX6)*y 6 +c\l)* yi +cX5)*y 5 +cX3)*y 3 +cX7)* y7 
x 1 =y 0 -y 4 +cX6)*y 2 -cX2)*y 6 +cX3)*y 1 -cX\)*y 5 -cX7)*y 3 -cX5)*y 1 
x 2 =y 0 -y 4 -cX6)*y 2 +cX2)*y 6 +cX5)*y,-cX7)*y 5 -cXl)*yi+cX3)*y 7 
x 3 =y Q + y A -cX2)*y 2 -cX6)*y 6 +cX7)*y,+cX3)*y 5 -cX5)*y,-cXl)*y 7 
x 7 =y 0 +y 4 +cX2)*y 2 +c '(6) * y 6 - c '(1) *y l -cXS)*y s -c '(3) * y 3 - c '(7) * y 7 
x 6 = y Q -y A + cX6)*y 2 -cX2)*y 6 -c'(3)*^, +cXV*y s +C(7)* y 2 +cX5)*y 7 
x 5 =y 0 -y 4 -cX6)*y 2 +cX2)*y 6 -cX5)*y 1 +cX7)*y 5 +cXV*y3-cX2)*y 1 
x 4 =y 0 +y 4 -c X2) *y 2 -c '(6) * y 6 - c '(7) *y x -c '(3) * y 5 + c '(5) *y 3 + c '(1) * y 7 

Note that the above expressions do not incorporate scale factors 14, which can be 
computed at the end of the calculation simply as a right bit-shift. 

1 5 FIG. 6 is a block diagram of a hardware architecture for computing an eight- 

point IDCT utilizing a MAAC kernel according to one embodiment of the present 
invention. The hardware architecture of FIG. 6 may be incorporated into a larger 
datapath for computation of an IDCT. As shown in FIG. 6, data loader 505 is coupled 
to four dual MAAC kernels 405(1 )-405(4), each dual MAAC kernel including two 

20 MAAC kernels sharing a common multiplier. Note that the architecture depicted in 
FIG. 6 is merely illustrative and is not intended to limit the scope of the claims 
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appended hereto. The operation of the hardware architecture depicted in FIG. 5 for 
computing the IDCT will become evident with respect to the following discussion. 

Utilizing the architecture depicted in FIG. 6, a 1-D 8-point IDCT can be 
computed in 5 clock cycles as follows: 
5 1 st clock: 

multl = c\l)*y l 
multl = c\3)*y, 
mult3 = c'(5)* y x 
mult4 = c'(J)*y l 



x 0 (clkl) = y 0 + mult\ + 0 
x 1 (clkl) = y 0 — multl + 0 
x, (clkl) = y 0 + multl + 0 
x 6 (clkl) = y 0 -mult2 + 0 
x 2 (clkl) = y 0 + multl + 0 
x 5 (clkl) = y 0 - mult3 + 0 
x 3 (clkl) = y 0 + mult 4 + 0 
x A (clkl) = y 0 - mult 4 + 0 



2nd Clock 
multl = c'(5)*^ 5 
multl = -c'(l)* y 5 
multl> = —c '(7) * y 5 
mult4 = c'(3)* y 5 



x 0 (clkl) = y 4 + multl + x 0 (clkl) 
x 1 (clkl) = y 4 - multl + x 7 (clkl) 
x r (clkl) = -y 4 + multl + x x (clkl) 
x b (clk2) = -y 4 - multl + x 6 (clkl) 
x 2 (clkl) = -y 4 + multl + x 2 (clkl) 
x b (clkl) = -y 4 - multl, + x 5 (clkl) 
x 3 (clkl) = y 4 + mult 4 + x 3 (clkl) 
x 4 (clkl) = y 4 - mult4 + x 4 (clkl) 
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3rd Clock 
multl = c'(3)*y 3 
multl = -c'(7)* y 3 
mulfi = -c '(1) * y 3 
multA = -c '(5)* y 3 

x 0 (clk3) = 0 + multl + x 0 (clk2) 
x 7 (clk3) = 0 - multl + x 7 (clk2) 
x, (clk3) = 0 + mwfr2 + x, (c/£2) 
x 6 (c//c3) = 0 - wiKft 2 + x 6 (c/*2) 
x 2 (c/*3) = 0 + mult3 + x 2 (c/fc2) 
x 5 (c/*3) = 0 - mult3 + x 6 (clkl) 
x 3 (clk3) = 0 + mult A + x 3 (clkl) 
x 4 (clk3) = 0-mult4 + x 4 (clkl) 



4th Clock 
multl = c'(7)* y 7 
multl = -c'(5)* y 7 
mult3 = c'(3)*y 7 
mult A = -c '(1)*^ 7 

x 0 (clkA) = 0 + multl + x 0 (clk3) 
Xj (dk A) = 0 - multl + x 7 (clk3) 
x, (c/&4) = 0 + multl + x, (c/fc3) 
x 6 (dk A) = 0 - mw/^2 + x 6 (c/£3) 
^ 2 ( c /£4) = 0 + mult3 + x 2 (clk3) 
x . (dk A) = 0 - mult3 + x 5 (clk3) 
x 3 (dk A) = 0 + mult A + x 3 (clk3) 
x 4 (clkA) = 0 - mult A + x 4 (c/*3) 



multl = c'(T)* y 2 
mult2 = c'(6)*y 6 
mult3 = c'(6)*y 2 
mult4 = -c'(2)*y 6 

x 0 (clk5) = multl + multl + x 0 (clkA) 
x 7 (clkS) = multl + multl + x 7 {elk A) 
x x (clkS) = multl + mult 4 + x 1 (clkA) 
x 6 (clkS) = multZ + mult A + x 6 {elk A) 
x 2 (clk5) = -multi - mult A + x 2 {elk A) 
x 5 (clkS) = -multi - mult A + x 5 {elk A) 
x 3 {clk5) = -multi - multl + x 3 {clkA) 
x 4 (elk 5) = -multi - multl + x 4 (elk A) 

FIG. 7 is a block diagram of a datapath for computation of an 8-point IDCT 
utilizing the method of the present invention and a number of MAAC kernel 
components according to one embodiment of the present invention. Note that the 
5 datapath shown in FIG. 7 includes four dual MAAC kernels 405(l)-405(4). 

According to an alternative embodiment, the MAAC kernel is modified to 
include two additional additions, to produce a structure herein referred to as the 
AMAAC kernel. The AMAAC kernel can be described by the following recursive 
equation: 

10 d [M] = d [i] + [a(i) + e(0] * b{i) + c(i) with initial value d [0] = 0. 

FIG. 8 is a block diagram illustrating the operation of an AMAAC kernel 
according to one embodiment of the present invention. The AMAAC kernel, as 
described below with reference to FIG. 9, provides a structure for achieving more 
efficient downsampling computations. AMAAC kernel 805 includes multiplier 310, 

15 first adder 320a, second adder 320b and register 330. First adder 320a adds a(i) and 
e(i) Multiplier 310 performs multiplication of input datum [a(i)+e(i)] and filter 
coefficient b(i), the result of which is passed to adder 320b. Adder 320b adds the 
result of multiplier 310 to a second input term c(i) along with accumulated output d [l] , 
which was previously stored in register 330. The output of adder 320 ( d [,+1] ) is then 

20 stored in register 330. 
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As two more additiqns are performed during the same AMAAC cycle, the 
AMAAC kernel has a higher performance throughput for some class of computations. 
For example, a digital filter with some filter coefficients with equal value can take 
advantage (speed up) of the AMAAC kernel. Specifically, a(i), c(i), and e(i) can be 
5 considered as input data and b(i) as filter coefficients. With inputs a(i) and e(i) having 
the same filter coefficients b(i) and inputs c(i) with unity coefficients, all three groups 
of inputs can be processed in parallel. 

The present invention may be applied to downsampling or filtering operations 
in general, not necessarily involving the IDCT. For example, the filtering of finite 
1 0 digital signals in the sample domain may be performed using convolution. A well- 
known circular convolution may be obtained, for example, by generating a periodic 
extension of the signal then applying a filter by performing a circular convolution on 
the periodically extended signal and an appropriate filter. This may be efficiently 
performed in the DFT domain, for example, by obtaining a simple multiplication of 
1 5 the DFT coefficients of the signal and the DFT coefficients of the filter and then 
applying the inverse DFT to the result. For the DCT, a convolution may be applied 
that is related to, but different from the DFT convolution. This is described, for 
example, in "Symmetric Convolution and the Discrete Sine and Cosine Transforms," 
by S. Martucci, IEEE Transactions on Signal Processing, Vol. 42, No. 5, May 1994, 
20 and includes a symmetric extension of the signal and filter, linear convolution, and 
applying a window to the result. 

For example, considering a 2-D signal and a 2-D filter,or, assuming that the 2- 
D signal is represented as y k ,i with DCT coefficients x u where {i, k} are from 0 to M-l 
and {j, 1} are from 0 to N-l, and assuming that the 2-D filter is represented as h p _ q 
25 where p ranges from 0 to P-l and q ranges from 0 to Q-l . According to this example, 
filter h p _ q may be a symmetric low pass even length filter with filter length P and Q, 
where P = 2M and Q = 2N. 

h p , q = h 2M _ p _ hq = h paN _ q _ x = A 2M _„_,. 2W _,_, forp = 0,V",M-land g = 0,l,-",tf-l. 
The DCT (frequency domain) coefficients H kJ for the filter h M may be obtained by 
30 applying a 2-D DCT to the fourth quadrant of the filter: 

«.-^)^)I|w J ^^H^) to *- w •■■^- , ■ rf '- ,u •••••' , - , • 

Then the filtering with respect to a particular sample in the inverse transform domain 
is performed by element-by-element multiplication of the signal DCT coefficients, 
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y k j, and the filter DCT coefficients; H k ,n and taking the appropriate inverse DCT 
transform of the DCT-domain multiplication results: 

Y kl = H kl -y kl for£ = 0,l,---,M-land/ = 0,l,---,Ar-l. 

Downsampling may be performed in the DCT domain. For example, downsampling 
5 by two (2: 1) in the horizontal direction, may be performed by taking the element-by- 
element multiplication of a signal that has been filtered, for example, according to the 
relationship: 

T kJ = -~ (Y kl - Y k ^_, ) for £ = 0,1, • ■ • ,M - 1 and I = 0,1, - • • , N / 2 - 1. 

Similarly, 2:1 downsampling in the vertical direction can also be performed in the 

10 DCT domain. 

The decimated signal is then obtained by applying the inverse DCT transform 
of length N/2 to Y k , . There are several special cases that might be usefully applied in 
this embodiment, although the invention is not limited in scope in this respect. For 
example, a brickwall filter with coefficients [1 1 1 1 0 0 0 0] in the DCT domain may 

15 be implemented that can further simplify the 8-point DCT domain downsampling by 
two operation. Specifically, the special filter shape avoids folding and addition. 
Another filer with coefficients [1 1 1 1 0.5 0 0 0] provides a transform function of an 
antialising filter for the 2:1 operation. Other filters may also be employed, of course. 
In order to map such filtered downsampling operation to an AMAAC 

20 computation kernel. The element-by-element multiplication operation in DCT domain 
can be incorporated in the Inverse Quantization block. Specifically, the filter DCT 
coefficients, H k j, can be combined together with the inverse quantization coefficients. 
Subsequently, the output of the IQ block is the filtered DCT coefficients, Y Kl , of the 
signal. 

25 FIG. 9 is a block diagram of a reconfigurable downsampling computation engine 
according to one embodiment of the present invention. As shown in FIG. 9, a 
plurality of adders and multipliers 910 are configured via switching fabric 905 to 
operate in either a non-downsampling 920 mode or a downsampling mode 930. 

According to one embodiment, to operate in non-downsampling mode 920, 

30 adders and multipliers 910 are configured via switching fabric 905 as a plurality of 
MAAC kernels 405(1 )-405(N). To operate in downsampling mode 930, adders and 
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multipliers 910 are configured via switching fabric 905 as a plurality of MAAC 
kernels 405(1)-405(N) and a plurality of AMAAC kernels 805(1)-805(N). 

According to one embodiment, MAAC and AMAAC computational kernels 
may be combined to generate a reconfigurable computation engine (for example, to 
5 compute the IDCT). By allowing this reconfiguration, hardware logic gates can be 
shared to improve performance without incurring additional cost. 

For example, a typical algorithm for computing a 1-D IDCT with 2:1 
downsampling is expressed as follows: 
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= c(4)*A' 
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where 





"1 1 c'(2) c'(6)~ 
1 -1 c'(6) -c'(2) 
1 -1 -c'(6) c'(2) 
1 1 -c(2) -c'(6) 

1 0 Note that for the downsampling operation, compared with a non-do wnsampling 1-D 
IDCT, addition is applied to the input data {y0, yl, . . .y7} first followed by 
multiplication with the coefficients c'. Since in the first path the input DCT 
coefficients arrive in a zig-zag order, for a given column the 1-D DCT coefficients 
arrive serially but interleaved with varying numbers of coefficients in other columns. 

15 If the downsampling operation is directly implemented using conventional hardware, 
there will be a significant number of idle cycles (bubbles) because of the random 
order of the arriving y coefficients. This may result in pipeline stalls in a 
conventional hardware architecture. 

According to one embodiment, a reconfigurable hardware architecture is 

20 realized by performing multiplication operations to the input y terms arriving serially 
first followed by additions. This ordering may be realized upon examination of the 
downsampling operation in expanded form as follows: 
"x 0 l rj 0 -J ; 7+>'2-J5+ c '(2)*v 1 -c'(2)*^ 5 +c , (6) :i: ^3-c , 6*>' 4 
x > _ (a\ Jo-J7-> ; 2+> ; 5+ c, (6)*v 1 -c'(6)*^ 6 -c , (2)*y3+c'(2)*y 4 

_x 3 J [y 0 -y 7 + y 2 -y s -c\2)*y x +c\2)*y 6 -c'(6)* y 3 +c'(6)* v 4 _ 



In the expanded vector equation above, note that multiplication operations are applied 
to the input y terms arriving serially first followed by additions. According to one 
embodiment, in 2:1 downsampling mode, the higher order coefficients (y5, y6 and y7) 
5 may be zeroed without causing significant degradation to the output video quality. 
This is a result of the nature of the energy compaction property of the IDCT. 

Zeroing the higher order coefficients, the following expression is obtained: 
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By expanding the above matrix equation, the following relationship is obtained: 

xo=yo+y2+c'(2)*yi+c\6)*(y 3 -y4) 

x 1 =y 0 -y2+c'(6)*y 1 +(-c'(2))*(y3-y4) 
15 x 2 =yo-y2+(-c'(6))*yi+c'(2)*(y 3 -y4) 
x 0 =yo+y2+c'(2)*yi+c'(6)*(y3-y4) 

According to one embodiment of the present invention, the above equations 
are realized according to a hardware embodiment depicted in FIG. 10. Referring to 
FIG. 10, in the downsampling mode, adders and multipliers are configured as MAAC 

20 kernels 405(1) (multiplier 3 10(2) and 320(6)) and 405(2) (multiplier 3 10(2) and adder 
320(4)) and an AMAAC kernel 805 (adders 320(1), 320(3) and multiplier 310(4)). 
Comparing the downsampling configuration shown in FIG. 10 with the non- 
downsampling configuration shown in FIG. 6, note that the four multipliers and eight 
adders utilized in the non-downsampling mode are also shown in the downsampling 

25 mode operation. However, note that four of adders are utilized as shared adders, 
specifically shared adder 320(5) computing y 0 -y 2 , shared adder 320(2), computing 
y 0 +y 2 and shared adder 320(1) computing y 3 -y 4 . Note that adder 320(7) is not utilized 
in the downsampling configuration. 

FIG. 1 la is a block diagram illustrating a datapath for computing a first path 

30 of an eight-point 2-D IDCT in a non-downsampling mode (also called the lhl v mode 
as the scaling ratios along both horizontal (h) direction and vertical (v) direction are 

18 



1:1) according to one embodiment of the present invention. The hardware 
architecture shown in FIG. 11a improves IDCT computation by simultaneously 
processing addition and multiply terms. 

As derived above, an 8-point IDCT may be expressed as follows: 



x 0 




y 0 




yx 




x i 




y 0 




yi 


X, 




y* 


+ B' 


y$ 




x 6 


=A 


y 4 




^5 


x 2 




y 2 








x } 




y 2 




^3 


X, 




y* 




yi 




x 4 




^6 




yi 



1 1 c'(2) 

1 -1 c'(6) 

1 -1 -c'(6) 

.1 1 ~c'(2) 



C(6)" 

-c'(2) 
oil) 
-c'(6) 



c'(l) c'(5) c'(3) 

d3) -oil) -oil) 

c'(5) c'(7) -c'(l) 

c'(7) c'(3) -c'(5) 



c'(7)" 
-e'(5) 
c'(3) 
-c'(l) 



As shown in FIG. 11a, the first path of the lhlv mode is configured to include four 
dual MAAC kernels 405(l)-405(4). Note that this configuration corresponds directly 
to FIG. 6. The four dual MAAC kernels 405(l)-405(4) allow simultaneously 
processing of addition and multiply terms. According to the above equation, IQ block 
130 generates coefficients that must be processed either by performing a 
multiplication or an addition. The utilization of the MAAC kernels in the datapath 
shown in FIG. 11a (see FIG. 6) allows simultaneous performance of the multiplication 
and addition, improving performance. 

A portion of the architecture shown in FIG. 1 la is responsible for 
demultiplexing addition and multiply terms received from IQ block 130. 
In particular, coefficients from IQ block 130 (not shown) are received through IQ 
interface where they are demultiplexed onto node 1 125 (addterm) or node 1 1 30 
(multiply_term) depending upon whether the coefficient is a multiply or addition 
term. 

Multiply terms are then passed from node 1 130 to one of multipliers 310(a)- 
310(d) via combinational logic. Similarly, addition terms are passed from node 1 125 
to one of adders 320(a)-320(h). In particular, in the lhlv first path, yO and y4 are 
addition terms while yl, y2, y3, y5, y6 and y7 are multiply terms. Note that these 
terms may be utilized immediately upon generation from IQ block 130. That is, a 
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bubble is not introduced into the pipeline while waiting for coefficients from IQ block 
130. 

The intermediate output terms of the first path of the IDCT are stored in 
transport storage unit 1 105 (TRAM) where they await processing by the second path 
of the IDCT computation. 

FIG. 1 lb is a block diagram illustrating a datapath for computing a second 
path of an eight-point 2-D IDCT in a non-downsampling mode according to one 
embodiment of the present invention. Note that the operative equation for the second 
path is identical to that presented above for the first path (see FIG. 11a and 
accompanying text). The configuration for the second lhlv path corresponds 
directly to FIG. 6. Thus, similar to the first path four dual MAAC kernels 405(1)- 
405(4) allow simultaneously processing of addition and multiply terms. According to 
the above equation, IQ block 130 generates coefficients that must be processed either 
by performing a multiplication or an addition. The utilization of the MAAC kernels 
in the datapath shown in FIG. 1 lb (see FIG. 6) allows simultaneous performance of 
the multiplication and addition, improving performance. 

With respect to the data flow, the only difference is that y terms in the 
equation are originated from TRAM 1 105 and the calculated x terms are output to 
motion compensation block 150. Note that upon completion of the lhlv 2nd path, 3 
right shifts are performed due to the initial factoring of 1/2 *C(4) in the first and 
second paths. Thus, at the end three right shifts 1/2* c(4) *l/2 *c(4) = 1/2 * 1/2 * 1/2 
are required in order to obtain the final correct 2-D IDCT results. 

FIG. 12a is a block diagram illustrating a datapath for computing a first path 
of an eight-point to four-point 2-D IDCT in a downsampling mode (also called the 
2h2v mode as the scaling ratios along both horizontal (h) direction and vertical (v) 
direction are 2:1) according to one embodiment of the present invention. Recall, as 
derived above, computation of a 2-1 downsampled eight-point IDCT may be 
expressed as follows: 

y 0 - y 7 + y 2 - y 5 + c' (2fyl - c' (2) *y 6 + c' (6) *y3 - c' (6) * y 4 
yO - y7 - y2 + y 5 + c'(6fy\ - c' (6) *y 6 - c' (2) *y3 + c' (2) * yA 
y0 - y 7 - y 2 + y 5 - c' (6f yl + c' (6) *y 6 + c' (2) *y3 - c' (2) * y 4 
yO - y 7 + y2 - y 5 - c' (2fyl + c' (2) *y 6 - c'(6) *y3 + c'(6) * y4_ 
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As shown in FIG. 12a, the first path of the 2h2v mode is configured to include 
MAAC kernels 405(l)-405(4). Note that this configuration corresponds directly to 
FIG. 6. The use of four MAAC kernels 405(l)-405(4) allows simultaneously 
processing of addition and multiply terms. According to the above equation, IQ block 
5 130 generates coefficients that must be processed either by performing a 

multiplication or an addition. The utilization of the MAAC kernels 405(l)-405(4) in 
the datapath shown in FIG. 12a (see FIG. 6) allows simultaneous performance of the 
multiplication and addition, improving performance. 

In particular, in the 2h2v first path, yO, y7, y2 and y5 are involved in an 

10 addition operation and y 1, y3, y4, y6 are involved in a multiply operation. 

Using the above architecture, it will take 4 clocks for 1 -column coefficients to finish 
the first path of a 2-D 2h2v IDCT. For an 8x8 block with all non-zero coefficients, it 
will take 4x8 = 32 clocks to finish 1-D IDCT. Because it is 2h2v mode, and 2:1 
downsampling is performed in a vertical direction, every time, only four terms are 

15 output per operation. Thus, although the architecture includes 4 multipliers and 8 
adders in the block diagram of FIG. 12a, only 4 multipliers and 4 adders are actually 
involved in the computation in the first path. 

FIG. 12b is a block diagram illustrating a datapath for computing a second 
path of an eight-point to four-point 2-D IDCT in a downsampling mode according to 

20 one embodiment of the present invention. In this path one AMAAC kernel 805 is 
utilized along with three additional addition computations. Note that the AMAAC 
kernel 805 utilizes adder 320(a) in a shared configuration. In the 2h2v 2nd path, 7 
adders and 4 multipliers are utilized. 

Also note that in the 2nd path of 2h2v, all the y inputs (yO, yl . . . y4) originate 

25 from TRAM 1 105, and thus all terms are available simultaneously. Further, note that 
the equation cited above for the first path 2h2v is also operative for the second path. 
By rewriting the above matrix equation as 

xO = yO + y2 + c'(2) * yl + c'(6) * (y3 - y4) 
30 x 1 = yO - y2 + c'(6) * yl + (-c f (2)) * (y3 - y4) 
x2 = yO - y2 + (-c'(6)) * yl + (c'(2) * (y3 - y4) 
x3 = yO + y2 + (-c'(2)) * yl + (-c'(6)) * (y3 - y4) 

it can be seen that the other adders 320(a)-320(c) may be used to calculate: 
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addl = y3 - y4 
add2 = yO - y2 
add3 = yO + y2 

and 4 multipliers 3 10(a)-320(d) can be used to calculate: 

multl = c'(2) * (y3 - y4) = c'(2) * addl 

mult2 = c'(6) * yl = c'(6) * yl 

mult3 = c'(2) * yl = c'(6) * yl 

mult4 = (-c'(6)) * (y3 - y4) = (-c'(6)) * addl 

In the final stage, the following state is obtained: 

xO = add3 + mult3 + (-mult4) 
xl = add2 + mult2 + (-multi) 
x2 = add2 + (-mult2) + (multl) 
x3 = add3 + (-mult3) + (mult4) 

Thus, for the second path in 2h2v mode, all 1-D row IDCT can be completed 
in just one cycle. This improved throughput in IDCT stage matches very well with the 
improved throughput in motion compensation unit 150 that follows. 

At the completion of the 2h2v 2nd path a right shift is necessary because in the 
first path and second path c(4) was factored out, and therefore the calculation c(4) * 
C(4) = 1/2 must be performed at the end. 

According to one embodiment the MAAC and AMAAC operations may be 
incorporated into the instruction set of a CPU or DSP processor that incorporates one 
MAAC and/or AMAAC kernels. This would allow compilation of a source program 
to directly take advantage of these hardware structures for signal processing 
operations. 



22 



